The goal of this analysis is an untargeted (data-driven) multi-omic analysis of metabolomics and 16S microbiome data generated from an ex vivo fecal incubation study. In this study, broccoli sprouts (broc), brussels sprouts (brus), and a combination of the two (combo), as well as a negative control (no_veg) underwent in vitro digestion. Following in vitro digestion, the chyme was incubated with human fecal samples (n = 40) for 24 hours followed by 16S sequencing and untargeted metabolomics analysis (our paper in Nutrients https://www.mdpi.com/2072-6643/13/9/3013 for more methodological details).
Our previous work with this dataset identified two separate microbiome profiles within our subjects: some subjects had microbiomes enriched with bacteria from taxa Clostridiaceae (C_Type individuals in this analysis) while others had microbiomes enriched with bacteria from taxa Enterobacteriaceae (E_Type individuals). Individuals with microbiomes rich in members of Family Clostridiaceae were found to have a higher abundance of sulforaphane-nitrile in their digesta compared to individuals with microbiomes rich in Enterobacteriaceae. The composition of the microbiomes and a PCoA plot are shown below.
setwd('~/Documents/Projects/poopsoup/poopsoup_two/')
library(tidyverse)
library(magrittr)
library(ggfortify)
library(lme4)
library(lmerTest)
library(factoextra)
library(phyloseq)
library(cowplot)
library(plotly)
library(remMap)
library(RGCCA)
#Helper functions:
log_helper <- function(x, min.val){
log2((x + sqrt(x^2 + min.val^2))/2)
}
#Pareto Scaling:
PS_helper <- function(x){
(x - mean(x))/sqrt(sd(x, na.rm = T))
}
#Transformation Functions:
#Log Scaling:
log_transform <- function(mtb){
mtb_nz <- mtb[ ,which(apply(mtb, 2, sum) != 0)]
min.val <- min(abs(mtb_nz[mtb_nz!=0]))/10
mtb_log_trans <- apply(mtb_nz, 2, log_helper, min.val)
return(mtb_log_trans)
}
#Pareto Scaling:
pareto_scale <- function(mtb){
mtb_scaled <- apply(mtb, 2, PS_helper)
return(mtb_scaled)
}
asvtab <- readRDS('../data/microbiome/asv_tab.RDS')
taxatab <- readRDS('../data/microbiome/tax_tab.RDS')
metadata <- read.csv('../data/microbiome/microbiome_metadata.csv')
metadata %<>% mutate(group = ifelse(group == 'A', 'C_Type', 'E_type'))
#Factorize and set the levels on the metadata
metadata$treatment %<>% factor(levels = c('fecal_stock', 'no_veg', 'broc', 'brus', 'combo', 'control_digest'))
metadata$fecal_sample %<>% factor(levels = c('T5631','T5632','T6260','T6291','T4669','T1995','T5627','T5717','T5854','T6382'))
rownames(metadata) <- metadata$sample
rownames(asvtab) <-metadata$sample
ps_raw <- phyloseq(otu_table(asvtab, taxa_are_rows = FALSE),
sample_data(metadata),
tax_table(taxatab))
#Give arbitrary names to the taxa as opposed to keeping as just DNA-sequences which identify them
taxa_names(ps_raw) <- paste0("ASV", seq(ntaxa(ps_raw)))
#Fill in missing genus names:
renames <- rownames(tax_table(ps_raw)[is.na(tax_table(ps_raw)[, 'Genus'])])
taxdf <- tax_table(ps_raw)[renames,]
renamed_genus <- unname(sapply(taxa_names(taxdf), function(x) paste0('f_', taxdf[x, 'Family'], '_', x)))
tax_table(ps_raw)[renames, 'Genus'] <- renamed_genus
#Remove the control digests, these are not relevant to our analysis
ps_raw <- ps_raw %>% subset_samples(treatment != 'control_digest')
#Agglomerate to the genus level
ps_genera <- ps_raw %>% tax_glom(taxrank = "Genus")
#Remove taxa not seen more than 3 times in at least 20% of the samples
ps_counts <- ps_genera %>% filter_taxa(function(x) sum(x > 3) > (0.2*length(x)), TRUE)
#Convert from counts to relative abundance
ps_relab <- ps_counts %>% transform_sample_counts(function(x) x / sum(x) )
#Filter out low abundance (>1e-5) taxa
ps <- ps_relab %>% filter_taxa(function(x) mean(x) > 1e-5, TRUE)
ps_final <- ps %>% subset_samples(treatment != 'fecal_stock')
microdata <- cbind(sample = data.frame(sample_data(ps_final))$metabolomics_neg_sample, data.frame(otu_table(ps_final))) %>%
mutate(sample = gsub('neg', 'ms', sample))
all_alpha <- plot_bar(ps_final, fill = 'Family', x = 'treatment') +
facet_wrap(~fecal_sample, ncol = 5) +
theme_cowplot() + theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
ggtitle('Relative Abundance of Bacterial Taxa Following Incubation')
xaxis <- list(title = 'Treatment',automargin = TRUE)
ggplotly(all_alpha, width = 900, height = 600) %>% layout(autosize = F, xaxis = xaxis)
ordBCall <- ordinate(ps_final, method = 'PCoA', distance = 'bray')
plot_ordination(ps_final, ordBCall, color = 'treatment') +
geom_point(aes(text = paste0('Fecal Sample: ', fecal_sample, '\n',
'Treatment: ', treatment, '\n',
'Group', group)),
size = 3) +
stat_ellipse(aes(group = group, fill = group), alpha = 0.2, geom = 'polygon') +
theme_minimal_grid() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
ggtitle('Beta Diversity of All Samples')
The goal of this analysis is to identify microbial taxa and metabolomic features identified associated with cruciferous vegetable consumption. Additionally, we aim to identify more metabolites associated with two microbiome profiles identified previously.
DNA was isolated from fecal cultures (post-incubation) using a QIAamp PowerFecal DNA kit per the manufacturer’s protocol. The fecal DNA concentration was measured using a Qubit dsDNA HS assay kit. PCR was used to amplify the 16S rRNA gene at the V4V5 region, then sequenced using an Illumina MiSeq to produce a sequence library using the Earth Microbiome Project protocol. This approach yielded 300 bp, paired end amplicon sequences at a target sequencing depth of 50,000 reads per sample. The 16S amplicon sequencing was done at the Center for Quantitative Life Sciences core facilities using established methods. Data preprocessing and identification of amplicon sequence variations (ASVs) was conducted using the DADA2 pipeline, as implemented in R (v3.5). Briefly, the reads were first trimmed for read quality and filtered for two expected errors, followed by a merging of paired reads and a removal of chimeric ASVs. The taxonomy was assigned using the Silva database V132 with the naïve Bayesian classifier built into DADA2. Using phyloseq, the ASVs were first agglomerated at the genus level, reducing the number of ASVs from 3557 to 935 due to a high number of unannotated ASVs. To remove noise from the dataset, highly rare genera not seen more than three times in at least 20% of the samples and with a relative abundance less than 0.001%, were filtered out, resulting in a final dataset of 72 ASVs. Rarefaction curves, using the vegan package, were built on agglomerated and filtered data to ensure all samples were sufficiently sequenced.
Metabolites from fecal culture medium were extracted (100 μL culture/100 μL ice cold 80:20 methanol:water), mixed vigorously, and clarified by centrifugation (13,000× g for 10 min). The supernatants were further diluted 1:10 with ice cold 80:20 methanol:water (v/v) and transferred to mass spectrometry (MS) vials. SFN-NIT and IBN-NIT were detected using LC-MS/MS in positive ion mode, as previously described. Briefly, HPLC was performed on a Shimadzu Nexera system with a phenyl-3 stationary phase column (Inertsil Phenyl-3, 5 µm, 4.6 × 150 mm) coupled to a quadrupole time-of-flight MS (AB SCIEX TripleTOF 5600). The samples were randomized, auto-calibration was performed every two samples, and a quality control sample, composed of a pooled aliquot from each sample, was analyzed every 10 samples. MS/MS information was obtained for all samples using information dependent acquisition (IDA), while sequential window acquisition of all theoretical spectra (SWATH) was performed only on quality control samples. Spectral data were processed using Progenesis QI (NonLinear Dynamics v2.4). Peak deconvolution for [M + H]+, [M + Na]+, and [M + NH4]+ adducts in positive ionization mode, and [M−H]-, [M + FA-H]-, and [M−H2O−H]- in negative ionization mode was performed in Progenesis QI. Additionally, feature intensities were normalized in Progenesis QI across samples by total ion current of all features. Metabolomic features with a CV greater than 50 in QC samples were removed from analysis due to high technical variation. The resulting data matrix was exported as a CSV for downstream analysis in R.
#Full datasets from Progenesis:
posugly <- read_csv('./data/raw_progenesis/R_Friendly/pos_metabolome_full.csv')
negugly <- read_csv('./data/raw_progenesis/R_Friendly/neg_metabolome_full.csv')
#Metadata
mdata_neg <- read_csv('~/Documents/Projects//poopsoup/data/metabolomics/neg/metadata_neg.csv') %>%
mutate(group = ifelse(group == 'A', 'C_Type', 'E_type'))
mdata_pos <- read_csv('~/Documents/Projects/poopsoup/data/metabolomics/pos/metadata_pos.csv') %>%
mutate(group = ifelse(group == 'A', 'C_Type', 'E_type'))
#Remove features which have a CV greater than 50:
pos_full <- posugly %>%
filter(is.na(cv_g_50)) %>%
select(compound, starts_with('pos')) %>%
mutate(compound = paste0(compound, '_pos')) %>%
pivot_longer(starts_with('pos'), names_to = 'sample', values_to = 'intensity') %>%
pivot_wider(names_from = 'compound', values_from = 'intensity')
neg_full <- negugly %>%
filter(is.na(cv_g_50)) %>%
select(compound, starts_with('neg')) %>%
mutate(compound = paste0(compound, '_neg')) %>%
pivot_longer(starts_with('neg'), names_to = 'sample', values_to = 'intensity') %>%
pivot_wider(names_from = 'compound', values_from = 'intensity')
We will start by transforming our data prior to analysis. First we will log-transform our data followed by Pareto-scaling. Pareto-scaling reduces the relative importance of large values while keeping the data structure partially in tact. Since Pareto-scaling is sensitive to large fold-changes in the data, log_2 -transformation is first applied. To handle 0s, in the data, the following log-transformation is applied: log2(x) = log2(x + sqrt(x^2 + a^2))/2 where a is the minimum non-0 value in the data.
#Scale our data:
pos_scaled <- pos_full %>%
column_to_rownames('sample') %>%
log_transform() %>%
pareto_scale()%>%
as.data.frame()
neg_scaled <- neg_full %>%
column_to_rownames('sample') %>%
log_transform() %>%
pareto_scale()%>%
as.data.frame()
#Generate the metadata to be in the same order as our metabolomics data
mpos <- left_join(data.frame(sample = rownames(pos_scaled)), mdata_pos)
mneg <- left_join(data.frame(sample = rownames(neg_scaled)), mdata_neg)
First we will look at the structure of our metabolomics data using a PCA analysis.
pca_pos <- prcomp(pos_scaled, center = TRUE)
autoplot(pca_pos, data = mpos, colour = 'treatment', shape = 'group', size = 3) + ggtitle('Positive Ionization Mode') +
stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
cowplot::theme_minimal_grid()
pca_neg <- prcomp(neg_scaled, center = TRUE)
autoplot(pca_neg, data = mneg, colour = 'treatment', shape = 'group', size = 3) + ggtitle('Negative Ionization Mode') +
stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
cowplot::theme_minimal_grid()
Unsurprisingly, no_veg, the negative control, appears to be the most different in our dataset but the other groups don’t seem to separate much. For negative ionization mode our dataset contains 11,584 features and for positive ionization mode our dataset contains 16,420 features (It is expected that positive ionization mode will contain a greater number of features due to a higher signal intensity compared to negative ionization mode (i.e. technical differences)). The high dimensionality our dataset may pose a problem for later analyses. Additionally, some features are redundant between our treatment groups, for example bile acids (a component of the in vitro digestion), which are present in the control digest and all other digests. To reduce the dimensionality of our dataset and remove noise, we only want to consider features of interest for our downstream multiomic integration. Additionally, we want to highlight two subsets of features, those which are differentially abundant between the different treatment groups, and those which are differentially abundant between our C_type and E_type individuals.
pos_prog <- read_csv('~/Documents/Projects/poopsoup/poopsoup_two/data/raw_progenesis/R_Friendly/pos_metabolome_full.csv')
neg_prog <- read_csv('~/Documents/Projects/poopsoup/poopsoup_two/data/raw_progenesis/R_Friendly/neg_metabolome_full.csv')
prog_sig_pos <- pos_prog %>%
filter(!is.na(q_u_05)) %>%
dplyr::select(compound, starts_with('pos')) %>%
mutate(compound = paste0(compound, '_pos')) %>%
pivot_longer(starts_with('pos'), names_to = 'sample', values_to = 'intensity') %>%
pivot_wider(names_from = 'compound', values_from = 'intensity')
prog_sig_neg <- neg_prog %>%
filter(!is.na(q_u_05)) %>%
dplyr::select(compound, starts_with('neg')) %>%
mutate(compound = paste0(compound, '_neg')) %>%
pivot_longer(starts_with('neg'), names_to = 'sample', values_to = 'intensity') %>%
pivot_wider(names_from = 'compound', values_from = 'intensity')
rmanova_sig_neg <- colnames(prog_sig_neg)[-1]
rmanova_sig_pos <- colnames(prog_sig_pos)[-1]
To determine the impact of treatment (no_veg, broc, brus, combo) on the metabolomic profile of the fecal cultures, a repeated measures ANOVA (RMANOVA) was conducted using Progenesis QI software. The hyperbolic arcsine (arcsinh) transformation was utilized prior to the RMANOVA being conducted. The False Discovery Rate (FDR) was controlled for at 5% (significance was determined as q < 0.05). In positive ionization mode 3766 features were found to be significantly different between treatment groups and in negative ion mode 2152 features were found to be significantly different between treatment groups.
Next we want to find features significantly different between the two microbiome profiles observed in our fecal cultures. To do so, we will be using a generalized linear mixed model (GLMM). The no_veg group has an unequal variance from the other three treatment groups, so instead of running a model with all four groups, one model containing the three vegetable groups (broc , brus, combo) will be run and a control model containing no_veg will be run separately then compared against it. LmerTest will be used to construct contrasts to test our model and our model will be built using the formula: intensity ~ treatment + group + treatment*group + (1|fecal_sample)
library(lmerTest)
lmer_data_pos <- mdata_pos %>%
select(sample, treatment, fecal_sample, group) %>%
right_join(pos_scaled %>% rownames_to_column('sample')) %>%
filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
column_to_rownames('sample') %>%
select(where(is.numeric)) %>%
select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
rownames_to_column('sample') %>%
left_join(mdata_pos %>% select(sample, treatment, fecal_sample, group)) %>%
pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')
lmer_pos <- lmer_data_pos %>%
group_by(feature) %>%
nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
mutate_at( # change the data in 'lmer' column
"data",
purrr::map, # allows iteration over data in 'lmer' column, returns list
function(x) {
lmer <- lmerTest::lmer(intensity ~ treatment + group + treatment * group + (1 |fecal_sample), data = x) # do the lmer test
c(lmer, x) # return results of lmer, as well as data for (r)anova later
}
)
broc_pos <- lmer_pos %>%
group_by(feature) %>%
group_modify(~ {
contest(.x$data[[1]][[1]], c(0,0,0,1,0,0), joint = FALSE)
}) %>%
set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
inset('term', value = 'broc') %>%
ungroup()
brus_pos <- lmer_pos %>%
group_by(feature) %>%
group_modify(~ {
contest(.x$data[[1]][[1]], c(0,0,0,1,1,0), joint = FALSE)
}) %>%
set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
inset('term', value = 'brus') %>%
ungroup()
combo_pos <- lmer_pos %>%
group_by(feature) %>%
group_modify(~ {
contest(.x$data[[1]][[1]], c(0,0,0,1,0,1), joint = FALSE)
}) %>%
set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
inset('term', value = 'combo') %>%
ungroup()
control_data_pos <- mdata_pos %>%
select(sample, treatment, fecal_sample, group) %>%
right_join(pos_scaled %>% rownames_to_column('sample')) %>%
filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
column_to_rownames('sample') %>%
select(where(is.numeric)) %>%
select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
rownames_to_column('sample') %>%
left_join(mdata_pos %>% select(sample, treatment, fecal_sample, group)) %>%
pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')
control_model_pos <- control_data_pos %>%
group_by(feature) %>%
nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
mutate_at( # change the data in 'lmer' column
"data",
purrr::map, # allows iteration over data in 'lmer' column, returns list
function(x) {
lmr <- lm(intensity ~ group, data = x) %>% # do the lmer test
summary() %>%
broom::tidy()
lmr[2,] # return results of lmer, as well as data for (r)anova later
}
) %>%
unnest('data') %>%
rename('pval' = p.value) %>%
mutate(padj = p.adjust(pval, method = 'BH')) %>%
select(feature, pval, padj)
sig_control_pos <- control_model_pos %>%
filter(padj <= 0.05) %>%
pull(feature)
lmer_data_neg <- mdata_neg %>%
select(sample, treatment, fecal_sample, group) %>%
right_join(neg_scaled %>% rownames_to_column('sample')) %>%
filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
column_to_rownames('sample') %>%
select(where(is.numeric)) %>%
select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
rownames_to_column('sample') %>%
left_join(mdata_neg %>% select(sample, treatment, fecal_sample, group)) %>%
pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')
lmer_neg <- lmer_data_neg %>%
group_by(feature) %>%
nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
mutate_at( # change the data in 'lmer' column
"data",
purrr::map, # allows iteration over data in 'lmer' column, returns list
function(x) {
lmer <- lmerTest::lmer(intensity ~ treatment + group + treatment * group + (1 |fecal_sample), data = x) # do the lmer test
c(lmer, x) # return results of lmer, as well as data for (r)anova later
}
)
broc_neg <- lmer_neg %>%
group_by(feature) %>%
group_modify(~ {
contest(.x$data[[1]][[1]], c(0,0,0,1,0,0), joint = FALSE)
}) %>%
set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
inset('term', value = 'broc') %>%
ungroup()
brus_neg <- lmer_neg %>%
group_by(feature) %>%
group_modify(~ {
contest(.x$data[[1]][[1]], c(0,0,0,1,1,0), joint = FALSE)
}) %>%
set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
inset('term', value = 'brus') %>%
ungroup()
combo_neg <- lmer_neg %>%
group_by(feature) %>%
group_modify(~ {
contest(.x$data[[1]][[1]], c(0,0,0,1,0,1), joint = FALSE)
}) %>%
set_colnames(c('feature', 'estimate', 'stderror', 'df', 'tvalue', 'lower', 'upper', 'pvalue')) %>%
inset('term', value = 'combo') %>%
ungroup()
control_data_neg <- mdata_neg %>%
select(sample, treatment, fecal_sample, group) %>%
right_join(neg_scaled %>% rownames_to_column('sample')) %>%
filter(treatment != 'no_veg') %>% #Remove from analysis as it has different variance from the rest of the samples
column_to_rownames('sample') %>%
select(where(is.numeric)) %>%
select(which(apply(., 2, sd) != 0)) %>% #Remove samples with 0 variance
rownames_to_column('sample') %>%
left_join(mdata_neg %>% select(sample, treatment, fecal_sample, group)) %>%
pivot_longer(cols = where(is.numeric), names_to = 'feature', values_to = 'intensity')
control_model_neg <- control_data_neg %>%
group_by(feature) %>%
nest() %>% # makes tibble of data for each feature, saves as 'lmer' column
mutate_at( # change the data in 'lmer' column
"data",
purrr::map, # allows iteration over data in 'lmer' column, returns list
function(x) {
lmr <- lm(intensity ~ group, data = x) %>% # do the lmer test
summary() %>%
broom::tidy()
lmr[2,] # return results of lmer, as well as data for (r)anova later
}
) %>%
unnest('data') %>%
rename('pval' = p.value) %>%
mutate(padj = p.adjust(pval, method = 'BH')) %>%
select(feature, pval, padj)
sig_control_neg <- control_model_neg %>%
filter(padj <= 0.05) %>%
pull(feature)
unique_neg <- rbind(broc_neg, brus_neg, combo_neg) %>%
mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
dplyr::filter(padj <= 0.05) %>%
filter(!feature %in% sig_control_neg)
unique_pos <- rbind(broc_pos, brus_pos, combo_pos) %>%
mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
dplyr::filter(padj <= 0.05) %>%
filter(!feature %in% sig_control_pos)
final_neg <- rbind(broc_neg, brus_neg, combo_neg) %>%
mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
dplyr::filter(padj <= 0.05)
final_pos <- rbind(broc_pos, brus_pos, combo_pos) %>%
mutate(padj = p.adjust(pvalue, method = 'BH')) %>%
dplyr::filter(padj <= 0.05)
glmm_sig_neg <- unique(final_neg$feature)
glmm_sig_pos <- unique(final_pos$feature)
#save(glmm_sig_neg, glmm_sig_pos, file = 'glmm_out.RData')
After comparing the treatment model to the control model, we find 2 features in positive ion mode and 3 features in negative ion mode that are significantly different in abundance between the two microbiome profiles in our individuals. Due to the low number of these findings, we will also include features that were also found to be significantly different in the control model taking our total number of significantly different features in positive and negative ion modes to 116 features and 136 features, respectively.
neg_ft <- unique(c(glmm_sig_neg, rmanova_sig_neg)) #2281 Features
pos_ft <- unique(c(glmm_sig_pos, rmanova_sig_pos)) #3871 Features
reduced_set_neg <- neg_scaled[ ,colnames(neg_scaled) %in% neg_ft]
reduced_set_pos <- pos_scaled[ ,colnames(pos_scaled) %in% pos_ft]
Lastly we combine the findings of our RMANOVA and GLMM to produce our final feature list which will be used for the multiomic analyses. In negative ion mode 6 features overlap between the GLMM and RMANOVA results while in positive ion mode 10 features overlap between the two modes. Our final datasets are 2281 features and 3871 features large for negative and positive ion mode, respectively.
Before moving forward to our multi-omic analyses, let’s construct our PCAs again to see how reducing our features as impacted the results.
pca_pos_re <- prcomp(reduced_set_pos, center = TRUE)
autoplot(pca_pos_re, data = mpos, colour = 'treatment', shape = 'group', size = 3) +
ggtitle('Positive Ionization Mode - Filtered') +
stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
cowplot::theme_minimal_grid()
pca_neg_re <- prcomp(reduced_set_neg, center = TRUE)
autoplot(pca_neg_re, data = mneg, colour = 'treatment', shape = 'group', size = 3) +
ggtitle('Negative Ionization Mode - Filtered') +
stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
cowplot::theme_minimal_grid()
combined_set <- cbind(reduced_set_neg, reduced_set_pos)
rownames(combined_set) <- gsub('neg', 'ms', rownames(combined_set))
mcom <- mneg %>%
mutate(sample = gsub('neg', 'ms', sample))
pca_com_re <- prcomp(combined_set, center = TRUE)
autoplot(pca_com_re, data = mcom, colour = 'treatment', shape = 'group', size = 3) + ggtitle('Combined Filtered Set') +
stat_ellipse(aes(group = treatment, color = treatment, fill = treatment), geom = 'polygon', alpha = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
cowplot::theme_minimal_grid()
Unsurprisingly, on PC1 we see no_veg pulling away from the broccoli groups and a relatively large amount of the total variation compared to the full dataset (~35% vs ~15%) is now attributed to a treatment effect. PC2 looks like a small amount of separation is occurring by group, but it is not super convincing.
Lastly, we have combined our positive and negative ion modes to create a single dataset that is reflective of the entire metabolome.Work is ongoing to use a tool called MSCombine that will help remove redundancies from these sets.
Is reducing the number of features for future analyses acceptable or should I use the entire dataset? I am worried that due to the extremely high dimensionality of the combined, unfiltered dataset (~28,000 features) there will be too much noise to find any prominent signals. Most multi-omic papers that I have seen (and statistical method papers) seem to narrow down their features to a select set, typically based on biological information. Since the identified of many of our metabolomic features is unknown and I am trying to take a data-driven approach to this analysis, this is a bit hard, thus, why I am trying to use statistical methods to select metabolomic features of interest that may play a role in metabolism
Our first multiomic technique that we will apply is a simple marginal correlation analysis. For this we will simply run a spearman’s correlation between ASVs and metabolomic features. No grouping nor treatment information will be given.
multiset_wide <- combined_set %>%
rename_with(~paste0('FT_', .x)) %>%
rownames_to_column('sample') %>%
left_join(microdata)
multiset_tidy <- multiset_wide %>%
pivot_longer(cols = starts_with('FT'), values_to = 'intensity', names_to = 'feature') %>%
pivot_longer(cols = starts_with('ASV'), values_to = 'relab', names_to = 'ASV') %>%
left_join(mcom)
spm_out <- multiset_tidy %>%
group_by(feature, ASV) %>%
nest() %>%
mutate(spm = map(data, function(x) cor.test(x$relab, x$intensity, method = 'spearman')))
spearman <- spm_out %>%
ungroup() %>%
mutate(rho = map_dbl(spm, function(x) x$estimate)) %>%
mutate(pval = map_dbl(spm, function(x) x$p.value)) %>%
mutate(padj = p.adjust(pval, method = 'BH'))
spm_ft <- spearman %>%
filter(padj <= 0.05) %>%
pull(feature) %>%
unique()
spm_asv <- spearman %>%
filter(padj <= 0.05) %>%
pull(ASV) %>%
unique()
Our spearman test evaluated each feature and ASV in a pairwise manner resulting in 442,944 tests. Of these tests, 11,440 came back as significant (2.6%). 4335 metabolomic features were found to have a significant correlation with at least 1 ASV, accounting for 70.5% of the metabolome. Similarly, 70 ASVs were found to have a significant correlation with at least 1 metabolomic feature accounting for 97% of the ASVs.
Next we will utilize dimension reduction techniques to analyze our data. Specifically, we will be using a Sparse Generalized Canonical Correlation Analysis (SGCCA). The sparsity parameter for each of our models is the optimal (calculated by the package using the Schafer and Strimmer formula) shrinkage parameter from the corresponding RGCCA (regularized general canonical correlation) model. For each block, 5 components will be built and the components which offer the best separation of both treatment and microbiome profile will be selected.
ft_block <- multiset_wide %>%
dplyr::select(starts_with('FT'))
asv_block <- multiset_wide %>%
dplyr::select(starts_with('ASV'))
#Recreate the metadata to make sure that they are in the same order as the features
multiset_meta <- multiset_wide %>%
dplyr::select('sample') %>%
left_join(mcom)
group_block <- multiset_meta %>%
dplyr::select(sample, group) %>%
mutate(group = ifelse(group == 'C_Type', 1, 0)) %>%
dplyr::select(group)
treatment_block <- multiset_meta %>%
dplyr::select(sample, treatment) %>%
mutate(trt_broc = ifelse(treatment == 'broc', 1, 0)) %>%
mutate(trt_brus = ifelse(treatment == 'brus', 1, 0)) %>%
mutate(trt_combo = ifelse(treatment == 'combo', 1, 0)) %>%
dplyr::select(starts_with('trt'))
Our first SGCCA analysis is akin to a multi-block PCA and will be used to identify metabolomic features and ASVs associated with the different treatment groups and the microbial profiles we observed in our individuals. The superblock contains a concatenation of the metabolomic features and the microbial ASVs. The design of the model is as follows:
superblock <- cbind(asv_block, ft_block)
A <- list(asv_block, ft_block, treatment_block, group_block, superblock)
#A,F,T,G,S
C <- matrix(c(0,1,1,1,1,#A
1,0,1,1,1,#F
1,1,0,1,1,#T
1,1,1,0,1,#G
1,1,1,1,0), 5, 5)
mod1_params <- rgcca(A, C, tau = 'optimal', ncomp = c(5,5,1,1,5), scheme = 'factorial', scale = TRUE, verbose = FALSE)
mod1 <- sgcca(A, C, c1 = c(mod1_params$tau[1,1], mod1_params$tau[1,2], 1, 1, mod1_params$tau[1,5]),
ncomp = c(5,5,1,1,5), scheme = 'factorial', scale = TRUE, verbose = FALSE)
df1 <- data.frame(Trt = multiset_meta$treatment,
Grp = multiset_meta$group,
Subject = multiset_meta$fecal_sample,
ASV1 = mod1$Y[[1]][,1],
ASV2 = mod1$Y[[1]][,2],
ASV3 = mod1$Y[[1]][,3],
FT1 = mod1$Y[[2]][,1],
FT2 = mod1$Y[[2]][,2],
FT3 = mod1$Y[[2]][,3],
SB1 = mod1$Y[[5]][,1],
SB2 = mod1$Y[[5]][,2],
SB3 = mod1$Y[[5]][,3],
SB4 = mod1$Y[[5]][,4],
SB5 = mod1$Y[[5]][,5])
ggplot(df1, aes(x = SB1, y = SB2, color = Trt)) +
geom_point(aes(shape = Grp), size = 3) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
stat_ellipse(aes(group = Trt, fill = Trt), geom = 'polygon', alpha = 0.1) +
theme(legend.position="bottom", legend.box = "horizontal",
legend.title = element_blank()) +
ggtitle('Model 1') +
theme_minimal_grid() +
xlab('Superblock Component 1') +
ylab('Superblock Component 2')
With the exception of a few individuals, the SGCCA did a good job at separating individuals by both treatment and microbiome profile. We will now dig into the variables using a correlation circle to identify features related to each metadata item. Features with a positive loading for superblock component 1 are associated with C-Type individuals while features with a negative loading are associated with E-Type individuals. Similarly, on Component 2 negative loadings are associated with the negative control treatment group while positive loadings are associated with the combo treatment group. When plotted together, features which lie close to each other on correlation circle are associated with one another. For now, I am not plotting the correlation circle but instead pulling out features which have high loadings to examine.
mod1_ft <- mod1$a[[5]][mod1$a[[5]][,1] != 0 | mod1$a[[5]][,2] != 0, ] %>%
as.data.frame() %>%
dplyr::select(V1, V2) %>%
rownames_to_column('feature') %>%
rename('comp1' = V1, 'comp2' = V2) %>%
mutate(type = ifelse(str_detect(feature, 'ASV'), 'ASV', 'FT'))
mod1_ft %>%
dplyr::filter(type == 'ASV')
## feature comp1 comp2 type
## 1 ASV1 -0.07557776 0.000000000 ASV
## 2 ASV4 0.07452644 0.000000000 ASV
## 3 ASV354 0.00000000 0.003486842 ASV
## 4 ASV1584 0.00000000 0.053111178 ASV
In the ASV dimension regularization resulted in 4 features not being shrunk to 0. This relatively small number of ASVs (5% of total ASVs) is not hugely surprising as our previous analysis of the microbiome showed that treatment had no effect on microbiome composition. ASV1 has a loading of -0.076 on component 1 and ASV4 has a loading of 0.075 on component 1. Unsurprisingly ASV1 is from family Enterobacteriaceae and ASV4 is from family Clostridiaceae, validating that our analysis is behaving as expected. For component 2, ASV354 had a very weak positive loading of 0.0035 and ASV1584 had a loading of 0.053. ASV1584 is from Genus Eggerthellaceae of family Coriobacteriales. Previous work in humans in humans has shown that cruciferous vegetable consumption is associated with an increase in bacteria from genus Eggerthellaceae. Additionally, kale consumption increased the abundance of bacteria from family Coriobacteriales in a diet-induced obesity mouse model.
For metabolomic features, 706 total features were retained by the model (11% of total features). Work is currently ongoing to identify the metabolomic features, however, to help validate our findings we can plot the intensity of each feature across the different treatment groups and the different microbial profiles. We will start with relatively strong loadings from component 1, which should show an association with a microbiome profile.
ftset <- mod1_ft %>% filter(type == 'FT')
cp1_top <- ftset %>%
dplyr::arrange(abs(comp1)) %>%
tail(n = 10) %>%
mutate(signal = ifelse(comp1 > 0, 'C-Type', 'E-Type'))
cp2_top <- ftset %>%
dplyr::arrange(abs(comp2)) %>%
tail(n = 20) %>%
mutate(signal = ifelse(comp2 > 0, 'combo', 'noveg'))
combined_tidy <- combined_set %>%
rename_with(~paste0('FT_', .x)) %>%
rownames_to_column('sample') %>%
pivot_longer(cols = starts_with('FT'), values_to = 'intensity', names_to = 'feature') %>%
left_join(multiset_meta)
ft_comp1 <- combined_tidy %>%
dplyr::filter(feature %in% cp1_top$feature[1:6]) %>%
left_join(cp1_top)
ggplot(ft_comp1, aes(x = group, y = intensity, color = signal)) +
geom_point() +
stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
facet_wrap(~feature, ncol = 3)
Overall, it looks like the features of interest are behaving as expected and the mean of each group (black diamonds) is higher for metabolites associated with one microbial profile or the other (as designated by color). Let’s perform the same kind of sanity check for features on component 2 which should be associated with different treatment groups.
ft_comp2 <- combined_tidy %>%
dplyr::filter(feature %in% cp2_top$feature[c(18:20,2:4)]) %>%
left_join(cp2_top)
ggplot(ft_comp2, aes(x = treatment, y = intensity, color = signal)) +
geom_point() +
stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
facet_wrap(~feature, ncol = 3)
For these features, the signal definitely appears stronger for the combo-associated features. For no-veg associated features, it appears that no-veg is marginally higher and that combo is typically the lowest abundance. Once these features are identified in Progenesis we will be better able to evaluate these findings.
Are these findings invalid because the metabolomics dataset was already primed for separation by only including features found to be significantly different between the two microbiome groups and between the treatment groups?
Next we will run our SGCCA model again, but this time using it attempt to discriminate between the treatment groups and the microbiome profiles. First we will run a model with the goal to discriminate between microbial profiles. The model design is as follows:
A2 <- list(asv_block, ft_block, superblock, group_block)
#A,F,S,G
C2 <- matrix(c(0,1,1,0,#A
1,0,1,0,#F
1,1,0,1,#S
0,0,1,0), 4, 4)
mod2_params <- rgcca(A2, C2, tau = 'optimal', ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = FALSE)
mod2 <- sgcca(A2, C2, c1 = c(mod2_params$tau[1,1], mod2_params$tau[1,2], mod2_params$tau[1,3], 1),
ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = FALSE)
df2 <- data.frame(Trt = multiset_meta$treatment,
Grp = multiset_meta$group,
Subject = multiset_meta$fecal_sample,
ASV1 = mod2$Y[[1]][,1],
ASV2 = mod2$Y[[1]][,2],
ASV3 = mod2$Y[[1]][,3],
FT1 = mod2$Y[[2]][,1],
FT2 = mod2$Y[[2]][,2],
FT3 = mod2$Y[[2]][,3],
SB1 = mod2$Y[[3]][,1],
SB2 = mod2$Y[[3]][,2],
SB3 = mod2$Y[[3]][,3],
SB4 = mod2$Y[[3]][,4],
SB5 = mod2$Y[[3]][,5])
ggplot(df2, aes(x = SB1, y = SB2, color = Trt)) +
geom_point(aes(shape = Grp), size = 3) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
stat_ellipse(aes(group = Grp, fill = Grp), geom = 'polygon', alpha = 0.4) +
theme(legend.position="bottom", legend.box = "horizontal",
legend.title = element_blank()) +
ggtitle('Model 2') +
theme_minimal_grid() +
xlab('Superblock Component 1') +
ylab('Superblock Component 2')
We see pretty clear clustering occuring across Component 1, let’s go thru and pull our highest loaded features again.
mod2_ft <- mod2$a[[3]][mod2$a[[3]][,1] != 0, ] %>%
as.data.frame() %>%
dplyr::select(V1, V2) %>%
rownames_to_column('feature') %>%
rename('comp1' = V1, 'comp2' = V2) %>%
mutate(type = ifelse(str_detect(feature, 'ASV'), 'ASV', 'FT'))
mod2_ft %>%
dplyr::filter(type == 'ASV')
## feature comp1 comp2 type
## 1 ASV1 -0.07599740 0.00000000 ASV
## 2 ASV4 0.07529448 -0.05267909 ASV
For ASVs, we are only getting ASVs 1 and 4 out, and they are behaving as expected.
Of the 266 metabolomic features that the model identified, 230 of them were also found to be significantly different between the microbial profiles by our GLMM. This makes up 86% of the features from the SGCCA and 91% of the features from our GLMM. Let’s focus in on the features not found from the GLMM:
grp_ft <- mod2_ft[!mod2_ft$feature %in% c(paste0('FT_', glmm_sig_neg), paste0('FT_', glmm_sig_pos)),] %>%
filter(type == 'FT') %>%
arrange(abs(comp1)) %>%
tail(n = 20) %>%
mutate(signal = ifelse(comp1 > 0, 'C-Type', 'E-Type'))
ft_comp2 <- combined_tidy %>%
dplyr::filter(feature %in% grp_ft$feature[c(18:20,17,13,2)]) %>%
left_join(grp_ft)
ggplot(ft_comp2, aes(x = group, y = intensity, color = signal)) +
geom_point() +
stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
facet_wrap(~feature, ncol = 3)
Once again, the means of each group are higher in microbial group they correspond to.
Let’s now run the same type of model, but this time trying to discriminate by treatment. The model design is as follows:
A3 <- list(asv_block, ft_block, superblock, treatment_block)
#A,F,S,G
C3 <- matrix(c(0,1,1,0,#A
1,0,1,0,#F
1,1,0,1,#S
0,0,1,0), 4, 4)
mod3_params <- rgcca(A3, C3, tau = 'optimal', ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = FALSE)
mod3 <- sgcca(A3, C3, c1 = c(mod3_params$tau[1,1], mod3_params$tau[1,2], mod3_params$tau[1,3], 1),
ncomp = c(5,5,5,1), scheme = 'factorial', scale = TRUE, verbose = FALSE)
df3 <- data.frame(Trt = multiset_meta$treatment,
Grp = multiset_meta$group,
Subject = multiset_meta$fecal_sample,
ASV1 = mod3$Y[[1]][,1],
ASV2 = mod3$Y[[1]][,2],
ASV3 = mod3$Y[[1]][,3],
FT1 = mod3$Y[[2]][,1],
FT2 = mod3$Y[[2]][,2],
FT3 = mod3$Y[[2]][,3],
SB1 = mod3$Y[[3]][,1],
SB2 = mod3$Y[[3]][,2],
SB3 = mod3$Y[[3]][,3],
SB4 = mod3$Y[[3]][,4],
SB5 = mod3$Y[[3]][,5])
ggplot(df3, aes(x = SB1, y = SB4, color = Trt)) +
geom_point(aes(shape = Grp), size = 3) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
stat_ellipse(aes(group = Trt, fill = Trt), geom = 'polygon', alpha = 0.4) +
theme(legend.position="bottom", legend.box = "horizontal",
legend.title = element_blank()) +
ggtitle('Model 3') +
theme_minimal_grid() +
xlab('Superblock Component 1') +
ylab('Superblock Component 4')
For model 3, on the 4th component we saw good separation all 4 groups, specifically broc and brus which we have not achieved before. Like before, we will examin high loading features.
mod3_ft <- mod3$a[[3]][mod3$a[[3]][,1] != 0 | mod3$a[[3]][,4] != 0, ] %>%
as.data.frame() %>%
dplyr::select(V1, V4) %>%
rownames_to_column('feature') %>%
rename('comp1' = V1, 'comp4' = V4) %>%
mutate(type = ifelse(str_detect(feature, 'ASV'), 'ASV', 'FT'))
mod3_ft %>%
dplyr::filter(type == 'ASV')
## feature comp1 comp4 type
## 1 ASV291 0.00000000 0.043632520 ASV
## 2 ASV415 0.00000000 0.001345626 ASV
## 3 ASV1487 0.00000000 -0.039269368 ASV
## 4 ASV1584 0.02146219 0.000000000 ASV
## 5 ASV1609 0.00000000 -0.008441891 ASV
## 6 ASV1736 0.00000000 -0.030990774 ASV
## 7 ASV2581 0.00000000 -0.017354076 ASV
We see 7 ASVs coming out of our model this time. On component 1 we only see ASV1584 as not being shrunk to 0 and it is associated with the combo which is expected. ASV291 corresponds to the Lachnospiraceae family was found to be associated with broccoli and studies in rats have shown that it decreases with cruciferous vegetable consumption while studies in mice shows that it increases. ASV1736 which corresponds to genus Oscillibacter and has been shown to increase with cruciferous vegetable consumption in rats and was found here to be associated with brussels sprouts. While a more comprehensive literature search is needed, the fact that known and previously observed ASVs are popping up seems promising.
Switching gears to metabolomic features, we see 925 metabolomic features coming back with non-zero loadings.
trt_ft_nc <- mod3_ft %>%
filter(type == 'FT') %>%
arrange(abs(comp1)) %>%
tail(n = 30) %>%
mutate(signal = ifelse(comp1 > 0, 'combo', 'noveg'))
nc_comp <- combined_tidy %>%
dplyr::filter(feature %in% trt_ft_nc$feature[c(30:28, 23, 18, 10)]) %>%
left_join(trt_ft_nc)
ggplot(nc_comp, aes(x = treatment, y = intensity, color = signal)) +
geom_point() +
stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
facet_wrap(~feature, ncol = 3) +
ggtitle('Component 1 Features of Interest')
trt_ft_bb <- mod3_ft %>%
filter(type == 'FT') %>%
arrange(abs(comp4)) %>%
tail(n = 7) %>%
mutate(signal = ifelse(comp4 > 0, 'broc', 'brus'))
bb_comp <- combined_tidy %>%
dplyr::filter(feature %in% trt_ft_bb$feature[c(1,3:7)]) %>%
left_join(trt_ft_bb)
ggplot(bb_comp, aes(x = treatment, y = intensity, color = signal)) +
geom_point() +
stat_summary(fun = 'mean', geom = 'point', color = 'black', size = 4, shape = 18) +
facet_wrap(~feature, ncol = 3) +
ggtitle('Component 4 Features of Interest')
trt_ft <- mod3_ft %>%
filter(type == 'FT')
As before, we see the same trend as before for component 1 (No Veg vs Combo) where typically features associated with no-veg have a slightly higher mean intensity in no-veg and a lowest overall mean intensity in combo. For component 4 (Brus vs Broc) there is a clear difference between features which are associated with broc and those associated with brus. Overall, 900 of the 925 metabolomic (97%) features from this model were identified as significantly different in the RMANOVA accounting for 15% of the total features found in the RMANOVA.
Is the high overlap between the features pulled from each model and those from GLMM/RMANOVA a red flag? Are my models biased and not selecting meaningful features?
I plan on conducting LOOCV to these models to evaluate feature importance in these models as a further validation step.
Once metabolite identification is done I also plan on searching the literature to find links between metabolites and microbes which are both associated with specific treatment groups, specifically if they have similar loadings to one another.
As a final analysis, I plan on using remMap to conduct a regularized regression analysis on this dataset. remMap utilized L1 and L2 penalization to generate a sparse model yielding betas for predictors that correspond to specific response variables. In this model, metabolomic features will be used as the predictors and ASVs will be used as a response variables (remMap requires p > q).
#Run on the shell since computationally intensive
#lamL1.v <- seq(1,11, by=1)
#lamL2.v <- seq(11,21, by=1)
#
#try3 <- remMap.CV(X=X_mat, Y=Y_mat,lamL1.v, lamL2.v, C.m=NULL, fold=5, seed=1)
#pick <- which.min(as.vector(try3$ols.cv))
#lamL1.pick <- try3$l.index[1,pick] ## find the optimal (LamL1,LamL2) based on the cv score
#lamL2.pick <- try3$l.index[2,pick]
#
#result <- remMap(X_mat, Y_mat,lamL1=lamL1.pick, lamL2=lamL2.pick, phi0=NULL, C.m=NULL)
result <- readRDS('./remMap/remMapRes.RDS')
final <- result$phi
rownames(final) <- colnames(ft_block)
colnames(final) <- colnames(asv_block)
final[apply(final, 1, sum) != 0, ]
## ASV1 ASV2 ASV4 ASV49 ASV68 ASV71 ASV72
## FT_4.23_907.2967_neg -0.012827447 0 0.008799967 0 0 0 0
## FT_4.57_269.9680_pos -0.017752231 0 0.013439985 0 0 0 0
## FT_7.28_570.9858_pos 0.007100841 0 -0.004511694 0 0 0 0
## ASV76 ASV83 ASV85 ASV89 ASV91 ASV93 ASV96 ASV130 ASV132
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0 0
## ASV134 ASV141 ASV145 ASV148 ASV151 ASV155 ASV157 ASV168
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## ASV172 ASV229 ASV241 ASV256 ASV277 ASV286 ASV291 ASV310
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## ASV354 ASV398 ASV415 ASV418 ASV427 ASV439 ASV457 ASV461
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## ASV462 ASV482 ASV487 ASV495 ASV501 ASV514 ASV556 ASV573
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## ASV584 ASV625 ASV683 ASV805 ASV820 ASV872 ASV875 ASV892
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## ASV914 ASV927 ASV934 ASV993 ASV1040 ASV1104 ASV1122
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0
## ASV1167 ASV1263 ASV1303 ASV1487 ASV1584 ASV1609 ASV1736
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0
## ASV1983 ASV2581
## FT_4.23_907.2967_neg 0 0
## FT_4.57_269.9680_pos 0 0
## FT_7.28_570.9858_pos 0 0
rmft <- rownames(final[apply(final, 1, sum) != 0, ])
spearman %>%
filter(padj <= 0.05) %>%
filter(feature %in% rmft) %>%
filter(ASV %in% c('ASV1', 'ASV4'))
## # A tibble: 6 × 7
## feature ASV data spm rho pval padj
## <chr> <chr> <list> <list> <dbl> <dbl> <dbl>
## 1 FT_4.23_907.2967_neg ASV1 <tibble [40 × 10]> <htest> -0.828 4.30e-11 8.99e-7
## 2 FT_4.23_907.2967_neg ASV4 <tibble [40 × 10]> <htest> 0.647 6.32e- 6 1.94e-3
## 3 FT_4.57_269.9680_pos ASV1 <tibble [40 × 10]> <htest> -0.779 3.28e- 9 8.40e-6
## 4 FT_4.57_269.9680_pos ASV4 <tibble [40 × 10]> <htest> 0.698 5.47e- 7 3.40e-4
## 5 FT_7.28_570.9858_pos ASV1 <tibble [40 × 10]> <htest> 0.753 2.13e- 8 3.01e-5
## 6 FT_7.28_570.9858_pos ASV4 <tibble [40 × 10]> <htest> -0.552 2.22e- 4 1.96e-2
RemMap gave us a highly sparse model with only 3 features not being shrunk to 0. Additionally, all these features had fairly small betas. Unsurprisingly, the ASVs that remMap was found to be influence were solely ASV1 and ASV4 which represent family Enterobacteriaceae and Clostridiaceae, respectively. Unsurprisingly, in our spearman analysis we found that all three features had significant correlations with the two ASVs, the direction of which matched the sign of the betas.
RemMap has two strategies to determine the shrinkage parameter, one based on the shrinked estimater (RSS) and based on the unshrinked estimated (RSS). OLS is recommended as RSS tends to yield larger models so we will try that next.
#Run on the shell since computationally intensive
#lamL1.v <- seq(1,11, by=1)
#lamL2.v <- seq(11,21, by=1)
#
#try3 <- remMap.CV(X=X_mat, Y=Y_mat,lamL1.v, lamL2.v, C.m=NULL, fold=5, seed=1)
#pick <- which.min(as.vector(try3$rss.cv))
#lamL1.pick <- try3$l.index[1,pick] ## find the optimal (LamL1,LamL2) based on the cv score
#lamL2.pick <- try3$l.index[2,pick]
#
#result <- remMap(X_mat, Y_mat,lamL1=lamL1.pick, lamL2=lamL2.pick, phi0=NULL, C.m=NULL)
remMapRSS <- readRDS('./remMap/remMapRes_rss.RDS')
final_rss <- remMapRSS$phi
rownames(final_rss) <- colnames(ft_block)
colnames(final_rss) <- colnames(asv_block)
final_rss[apply(final_rss, 1, sum) != 0, ]
## ASV1 ASV2 ASV4 ASV49 ASV68 ASV71 ASV72
## FT_4.23_907.2967_neg -2.126679e-02 0 0.0152492388 0 0 0 0
## FT_5.37_161.0060_neg -5.484552e-04 0 0.0005679717 0 0 0 0
## FT_22.87_166.0858_pos 4.064068e-07 0 0.0000000000 0 0 0 0
## FT_4.57_269.9680_pos -2.022381e-02 0 0.0159197539 0 0 0 0
## FT_7.28_570.9858_pos 3.206878e-03 0 -0.0021141747 0 0 0 0
## FT_4.20_675.3031_pos -1.780895e-03 0 0.0013044583 0 0 0 0
## ASV76 ASV83 ASV85 ASV89 ASV91 ASV93 ASV96 ASV130 ASV132
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0 0 0
## ASV134 ASV141 ASV145 ASV148 ASV151 ASV155 ASV157 ASV168
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0 0
## ASV172 ASV229 ASV241 ASV256 ASV277 ASV286 ASV291 ASV310
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0 0
## ASV354 ASV398 ASV415 ASV418 ASV427 ASV439 ASV457 ASV461
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0 0
## ASV462 ASV482 ASV487 ASV495 ASV501 ASV514 ASV556 ASV573
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0 0
## ASV584 ASV625 ASV683 ASV805 ASV820 ASV872 ASV875 ASV892
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0 0
## ASV914 ASV927 ASV934 ASV993 ASV1040 ASV1104 ASV1122
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0
## ASV1167 ASV1263 ASV1303 ASV1487 ASV1584 ASV1609 ASV1736
## FT_4.23_907.2967_neg 0 0 0 0 0 0 0
## FT_5.37_161.0060_neg 0 0 0 0 0 0 0
## FT_22.87_166.0858_pos 0 0 0 0 0 0 0
## FT_4.57_269.9680_pos 0 0 0 0 0 0 0
## FT_7.28_570.9858_pos 0 0 0 0 0 0 0
## FT_4.20_675.3031_pos 0 0 0 0 0 0 0
## ASV1983 ASV2581
## FT_4.23_907.2967_neg 0 0
## FT_5.37_161.0060_neg 0 0
## FT_22.87_166.0858_pos 0 0
## FT_4.57_269.9680_pos 0 0
## FT_7.28_570.9858_pos 0 0
## FT_4.20_675.3031_pos 0 0
rmft_rss <- rownames(rmft_rss <- final_rss[apply(final_rss, 1, sum) != 0, ])
spearman %>%
filter(padj <= 0.05) %>%
filter(feature %in% rmft_rss) %>%
filter(ASV %in% c('ASV1', 'ASV4'))
## # A tibble: 10 × 7
## feature ASV data spm rho pval padj
## <chr> <chr> <list> <list> <dbl> <dbl> <dbl>
## 1 FT_4.23_907.2967_neg ASV1 <tibble [40 × 10]> <htest> -0.828 4.30e-11 8.99e-7
## 2 FT_4.23_907.2967_neg ASV4 <tibble [40 × 10]> <htest> 0.647 6.32e- 6 1.94e-3
## 3 FT_5.37_161.0060_neg ASV1 <tibble [40 × 10]> <htest> -0.710 2.94e- 7 2.20e-4
## 4 FT_5.37_161.0060_neg ASV4 <tibble [40 × 10]> <htest> 0.715 2.24e- 7 1.80e-4
## 5 FT_4.57_269.9680_pos ASV1 <tibble [40 × 10]> <htest> -0.779 3.28e- 9 8.40e-6
## 6 FT_4.57_269.9680_pos ASV4 <tibble [40 × 10]> <htest> 0.698 5.47e- 7 3.40e-4
## 7 FT_7.28_570.9858_pos ASV1 <tibble [40 × 10]> <htest> 0.753 2.13e- 8 3.01e-5
## 8 FT_7.28_570.9858_pos ASV4 <tibble [40 × 10]> <htest> -0.552 2.22e- 4 1.96e-2
## 9 FT_4.20_675.3031_pos ASV1 <tibble [40 × 10]> <htest> -0.785 1.99e- 9 5.70e-6
## 10 FT_4.20_675.3031_pos ASV4 <tibble [40 × 10]> <htest> 0.655 4.57e- 6 1.57e-3
Out of this model, we see 3 new features being selected in our model compared to the first iteration. Once again, all the selected features have significant spearman correlations, except for FT_22.87_166.0858 which also has the smallest beta of the selected features.
Would it be beneficial to run other regression analyses such as elastic net? remMap cannot deal with mixed models, as we have here, would using a package like lmmen or glmmLasso which create a mixed model be more useful than remMap?
Feature identification for the metabolomics data is currently on going and we are currently focusing on the features output from the GLMM and RMANOVA models for identification. Once these are completed I plan to begin drafting a manuscript of this data with a goal for the first draft being done by the end of winter term.